Exploring Red Wine Quality by Aakarsh Goel

Overview Of the Dataset

This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). In this exercise, I will explore a data set on wine quality and physicochemical properties. The objective is to explore which chemical properties influence the quality of red wines.

Compact structure of the Dataset

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Summaries Of all variables inside the Dataset

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Here it shows mean, median and other statistical factors of each variable. Quality’s median value is 6 and mean value is 5.636. Mean and median is quite close.

Contingency table of the counts at each combination of factor levels.

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

It shows that there are 5 types of numerical quality in this data set ranging from 3 to 8 and most values of quality are 5 & 6.

Factoring quality

##  Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...

Univariate Plots

Creating Histograms for all the 12 variables.

Histograms depicts that density, pH and quality have similar structure i.e in normalised form. Others have typical structure some are skewed to left, some have oultiers mostly sulphur related factors, chlroides and residual sugar. Citric acid contains many null values.

New Features

  1. Total Acidity
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.270   7.827   8.720   9.118  10.070  17.050

As all the three factors fixed acidity, volatile acidity and citric acid comprises of acidic features and also vary from structure like quality, creating a new variable total_acidity as a sum of all of these factors.

  1. Quality Review
##  low  avg high 
##   63 1319  217

Quality variable has a discrete range of only 3-8, Majority of the wines examined got ratings of 5 or 6, and very less got 3, 4, or 8. So grouping the quality into a new variable review as ‘low’ (review 0 to 4), ‘avg’ (review 5 or 6), and ‘high’ (review 7 to 10).

Boxplots for all 12 variables with 2 new features

Boxplots justify the results from histograms and show outliers frequency in each variable residual sugar, chlorides, sulphates tend to have many outliers.

Univariate Analysis

What is the structure of your dataset?

## 'data.frame':    1599 obs. of  16 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality.factor      : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##  $ total_acidity       : num  8.1 8.68 8.6 12.04 8.1 ...
##  $ review              : Ord.factor w/ 3 levels "low"<"avg"<"high": 2 2 2 2 2 2 2 3 3 2 ...

What is/are the main feature(s) of interest in your dataset?

Main features of interest is the ‘quality’ and ‘review’ as main focus is to analyze how wine quality and its review is affecting with other factors. Also quality shows quite normal distribution where the bulk of the observations are in the 5-6 range.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Its difficult to find out from Univariate Analysis but density, pH, total_acidity can help due to their similar structure with quality.

Did you create any new variables from existing variables in the dataset?

Yes I created two new variables ‘review’ and ‘total_acidity’ which I have explained above in New features section.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of citric acid is fairly unusual given that the distribution of fixed acidity and volatile acidity on a logarithmic scale conforms to the normal distribution of pH. It appears that citric acid has a large number of null values, which could be incomplete or unavailable data.

132 null values are removed after this scaling in citric acid. The dataset in general was fairly tidy such that additional auditing and cleaning was not needed. Some outliers are there but they can be adjusted in other analysis without any problem.

Bivariate Plots Section

Making scatter plot of some interesting variables in dataset

The bivariate plots began with a scatterplot matrix. Unfortunately, due to the large file size, generating such a plot took much too long. Instead, a sample of the dataset was used to begin the exploration. Still, the plot was very untidy and difficult to understand and deduce any result from that.

Making Bivariate Boxplots for each feature with review.

From exploring these boxplots, it seems that a high quality red wine generally has these properties:

  • higher fixed acidity (tartaric acid) and citric acid, lower volatile acidity (acetic acid)
  • lower pH (i.e. more acidic)
  • higher sulphates
  • higher alcohol
  • to a lesser extend, lower chlorides and lower density

Finding Correlation with quality

##        fixed.acidity     volatile.acidity          citric.acid 
##           0.12405165          -0.39055778           0.22637251 
##        total_acidity       residual.sugar            chlordies 
##           0.10375373           0.01373164          -0.12890656 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##          -0.05065606          -0.18510029          -0.17491923 
##                   pH            sulphates              alcohol 
##          -0.05773139           0.25139708           0.47616632

Quantitatively, it appears that the following variables have relatively higher correlations to wine quality:

  • alcohol(0.47616632)
  • sulphates(0.25139708)
  • volatile acidity(-0.39055778)
  • citric acid(0.22637251)
  • total.sulfur.dioxide (-0.18510029)
  • density(-0.17491923)

Plotting these features more with quality

All the above plots justify the correlation that how other variables increase or decrease with quality.

Plotting relationships b/w the variables having high correlation with quality.

These scatterplots shows that alcohol, sulphates, citric acid and volatile.acidity are highly correlated factors and all of them affect most of the results alcohol, citric.acid and sulphates in positive way and volatile.acidity in negative way therefore balanced b/w the factors is necessary for best results in wine quality.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

From the boxplots, it appears that fixed acidity, citric acid, sulphates and alcohol are directly correlated with better wine quality, and volatile acidity and pH are indirectly correlated. From the correlation tests, similar trends were observed with the exception of the pH showing only about 5.7% correlation and suphates having a better correlation of 25.14%. Quality doesn’t depend much on density

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

the logarithmic relationship of acidity and pH were observed.

##        cor 
## -0.7044435

It justifies the relation of acidity with pH as its logarithm is inversely proportional to pH scale

Also alcohol and volatile.acidity are correlated

##       cor 
## -0.202288

Citric Acid and volatile.acidity correlations

##        cor 
## -0.5524957

What was the strongest relationship you found?

The strongest relationship was b/w alcohol and quality i.e 0.47616632 correlation which implies quality improves with alcholic content. quality declines with increase in volatile acidity with -0.39055778 correlation.

Also sulphates(0.25139708), citric acid(0.22637251), total.sulfur.dioxide (-0.18510029), density(-0.17491923) are related with quality on that correlation.

Multivariate Plots Section

Let’s see how these variables compare, plotted against each other and faceted by wine rating and coloured by wine quality.

This plot depicts that alcohol is more positively correlated with quality than sulphates but still increase in levels of both the factors improve quality of wine.

It depicts that when alcohol values are high and volatile acidity is low then high quality wines will be formed. And for average wine quality both the factors should be balanced.

Both the factors alcohol and citric acid are positively correlated with quality.

Sulphates don’t show much correlation with volatile acidity alone, volatile acidity lowers the quality of wine more prominently than sulphates.

Both the factors citric acid and sulphates together lead to increase the quality of the red wine, but citric acid has more effect than sulphates in increasing the quality of the wine.

Correlation b/w citric acid and volatile acidity is -0.55 and both the factors affect quality in positive and negative aspects.

Two main features I found out which effects wine quality a lot are alcohol and volatile acidity so lets plot b/w them at extreme wine review i.e which makes wine more low or more high.

The plot clearly depicts that quality review becomes high when when alcohol volume is high and becomes low when volatile acidity content becomes low.

Also alcohol and sulphates together affect a lot in quality review in positive direction so lets visualize how they affect quality extremities together.

Above plot depicts that both factors alcohol and sulphates at high levelsresult in high quality review and vice versa.

Above plots show that volatile acidity is very negatively correlated with quality and other positive factors. Volatile acidity makes not only quality of beer low but affects citric acid, alcohol, sulphates and many other positive features.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

For the multivariate plots, the features that bore the strongest relationship to quality were observed by splitting the plots by quality score and faceting them by the three review categories. It resulted that higher alcohol, sulphates, citric acid, and fixed acidity, and lower volatile acidity leads to better wine quality. This analysis is made so far.

Were there any interesting or surprising interactions between features?

Since alcohol, specifically ethanol, is a weak acid, it was thought to be somewhat correlated with the presence of other acids, such as citric acid. The plot of alcohol against citric acid in Multivariate plots section clearly show their lack of correlation to each other.

Also not much effect of pH and total_acidity is found on visualisation as pH range is small b/w 3 to 4 hence not affecting much of quality and due to correlation of pH and total acidity, it also doesn’t affect the result much

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

No, I didn’t create any model.

Final Plots and Summary

Plot 1: Effect of different acids on wine quality

Description One

These plots were created to demonstrate the effect of acidity on wine quality. Generally, higher acidity (or lower pH) is seen in highly-rated wines. To caveat this, a presence of volatile (acetic) acid negatively affected wine quality. Citric acidity had a high correlation with wine quality, while fixed (tartaric) acid and total_acidity had a smaller impact.

Plot 2: Effect of Alcohol and volatile Acidity on Wine extreme qualities

Description Two

This is perhaps the most descriptive visualisation. I subsetted the data to remove the ‘average’ quality wines, or any wine with a rating of 5 or 6. As the correlation tests show, wine quality was affected most strongly by alcohol and volatile acidity.It shows that high volatile acidity kept wine quality down and vice-versa. A combination of high alcohol content and low volatile acidity produced better wines with few outliers.

Plot 3: Sulphates and alcohol on Wine quality ratings

Description Three

Its most interesting and important visualisation that shows good wines have an abundance of sulphates and alcohol at the same time. The dotted lines represent the mean for each respective axis, whereby the top right quadrant has a large density of ‘high’ wine ratings.

Reflection

Through this exploratory data analysis, I was able to identify the key factors that are correlated with red wine quality, i.e, alcohol , sulphates, and acidity.

I faced difficulty in plotting ggpairs scatterplot it was very complicated and I simplified it using limited variables for plot

I founded that Alcohol, citric acid, sulphates are positively correlated with quality Volatile acidity alone has a lot of negative correlation with quality.

I mainly used Scatterplots, Boxplots and histograms for exploratory visualization of this dataset. The final plots depict the relationship of acidity to a good wine, and most importantly, such a wine will likely contain high alcohol content, high sulphates and low volatile acidity.

There should be more information in the dataset like oxidation factors of wine which really affects it quality because oxidation develops and adds aromatic complexity. As a result, the wines become more flavorful and earthy. In red wines, it softens the tannins and stabilizes color.

Now for future work using these factors as the features, a predictive model can be made using machine learning algorithms which predicts that what quality review should be given to beer with certain features.